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Abstract 

(N 

Clustering problems have numerous applications and are becoming more challenging as the size 
'■ of the data increases. In this paper, we consider designing clustering algorithms that can be used in 

\ MapReduce, the most popular programming environment for processing large datasets. We focus on the 

practical and popular clustering problems, fc-center and fc-median. We develop fast clustering algorithms 
with constant factor approximation guarantees. From a theoretical perspective, we give the first analysis 
that shows several clustering algorithms are in M1ZC , a theoretical MapReduce class introduced by 
Karloff et al. |26|. Our algorithms use sampling to decrease the data size and they run a time consuming 
Ci ■ clustering algorithm such as local search or Lloyd's algorithm on the resulting data set. Our algorithms 

ryj ' have sufficient flexibility to be used in practice since they run in a constant number of MapReduce 

O . rounds. We complement these results by performing experiments using our algorithms. We compare the 

empirical performance of our algorithms to several sequential and parallel algorithms for the fc-median 
problem. The experiments show that our algorithms' solutions are similar to or better than the other 
algorithms' solutions. Furthermore, on data sets that are sufficiently large, our algorithms are faster than 
■ the other parallel algorithms that we tested. 
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1 Introduction 

Clustering data is a fundamental problem in a variety of areas of computer science and related fields. Ma- 
chine learning, data mining, pattern recognition, networking, and bioinformatics use clustering for data anal- 
ysis. Consequently, there is a vast amount of research focused on the topic |[^ |29l[T^[T^ |^ [^[T0l[T7l l9ll3T|. 
In the clustering problems that we consider in this paper, the goal is to partition the data into subsets, called 



5_i ■ clusters, such that the data points assigned to the same cluster are similar according to some metric. 



In several applications, it is of interest to classify or group web pages according to their content or 
cluster users based on their online behavior. One such example is finding communities in social networks. 
Communities consist of individuals that are closely related according to some relationship criteria. Finding 
these communities is of interest for applications such as predicting buying behavior or designing targeted 
marking plans and is an ideal application for clustering. However, the size of the web graph and social 
network graphs can be quite large; for instance, the web graph consists of a trillion edges ll30l . When the 
amount of data is this large, it is difficult or even impossible for the data to be stored on a single machine, 
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which renders sequential algorithms unusable. In situations where the amount of data is prohibitively large, 
the MapReduce lfl6l programming paradigm is used to overcome this obstacle. MapReduce and its open 
source counterpart Hadoop 11331 are distributed computing frameworks designed to process massive data 
sets. 

The MapReduce model is quite novel, since it interleaves sequential and parallel computation. Suc- 
cinctly, MapReduce consists of several rounds of computation. There is a set of machines, each of which 
has a certain amount of memory available. The memory on each machine is limited, and there is no commu- 
nication between the machines during a round. In each round, the data is distributed among the machines. 
The data assigned to a single machine is constrained to be sub-linear in the input size. This restriction is 
motivated by the fact that the input size is assumed to be very large |[26l[T5l . After the data is distributed, 
each of the machines performs some computation on the data that is available to them. The output of these 
computations is either the final result or it becomes the input of another MapReduce round. A more precise 
overview of the MapReduce model is given in Section ITTTI 

Problems: In this paper, we are concerned with designing clustering algorithms that can be implemented 
using MapReduce. In particular, we focus on two well-studied problems: metric fc-median and A;-center. In 
both of these problems, we are given a set V of n points, together with the distances between any pair of 
points; we give a precise description of the input representation below. The goal is to choose k of the points. 
Each of the k chosen points represents a cluster and is referred to as a center. Every data point is assigned 
to the closest center and all of the points assigned to a given point form a cluster. In the fc-center problem, 
the goal is to choose the centers such that the maximum distance between a center and a point assigned 
to it is minimized. In the fc-median problem the objective is to minimize the sum of the distances from 
the centers to each of the points assigned to the centers. Both of the problems are known to be NP-Hard. 
Thus previous work has focused on finding approximation algorithms |[231 l5l [T2l[TTl l4ll2Tl[8l. Many of the 
existing algorithms are inherently sequential and, with the exception of the algorithms of Hl|20l, they are 
difficult to adapt to a parallel computing setting. We discuss the algorithms of l20l in more detail later. 

Input Representation: Let d : V x V — s> M.+ denote the distance function. The distance function d is a 
metric, i.e., it has the following properties: (1) d(x, y) = if and only if x = y, (2) d(x, y) = d(y, x) for all 
x, y, and (3) d(x, z) < d(x, y) + d(y, z) for all x, y, z. The third property is called the triangle inequality; we 
note that our algorithms only rely on the fact that the distances between points satisfy the triangle inequality. 

Now we discuss how the distance function is given to our algorithms. In some settings, the distance 
function has an implicit compact representation; for example, if the distances between points are shortest 
path distances in a sparse graph, the graph represents the distance function compactly. However, currently 
there does not exist a MapReduce algorithm that computes shortest paths in a constant number of rounds, 
even if the graph is unweighted. This motivates the assumption that we are given the distance function 
explicitly as a set of 0(n 2 ) distances, one for each pair of points, or we are given access to an oracle that 
takes as input two points and returns the distance between them. Throughout this paper, we assume that the 
distance function is given explicitly. More precisely, we assume that the input is a weighted complete graph 
G = (V, E) that has an edge xy between any two points in V, and the weight of the edge xy is d(x, yj^ 
Moreover, we assume that k is at most 0(n 1 ~ s ) for some constant 6 > 0, and the distance between any pair 
of points is upper bounded by some polynomial in n. These assumptions are justified in part by the fact that 
the number of points is very large, and by the memory constraints of the MapReduce model; we discuss the 
MapReduce model in more detail in Section ITTTI 

Contributions: We introduce the first approximate metric fc-median and /c-center algorithms designed to 
'We note that some of the techniques in this paper extend to the setting in which the distance function is given as an oracle. 



2 



run on MapReduce. More precisely, we show the following results. 

Theorem 1.1. There is a randomized constant approximation algorithm for the k-center problem that, 
with high probability, runs in 0(h) MapReduce rounds and uses memory at most 0(k 2 n s ) on each of the 
machines for any constant 5 > 0. 

Theorem 1.2. There is a randomized constant approximation algorithm for the k-median problem that, 
with high probability, runs in O(^) MapReduce rounds and uses memory at most 0(k 2 n s ) on each of the 
machines for any constant 5 > 0. 

To complement these results, we run our algorithms on randomly generated data sets. For the fc-median 
problem we compare our algorithm to a parallelized implementation of Lloyd's algorithm ll28l 13 CD, ar- 
guably the most popular clustering algorithm used in practice (see (2] 122 for example), the local search 
algorithm 0121], the best known approximation algorithm for the fc-median problem and a partitioning 
based algorithm that can parallelize any sequential clustering algorithm (see Section [4]). Our algorithms 
achieve a speed-up of lOOOx over the local search algorithm and 20x over the parallelized Lloyd's algo- 
rithm, a significant improvement in running time. Further, our algorithm's objective is similar to Lloyd's 
algorithm and the local search algorithm. For the partitioning based algorithm, we show that our algorithm 
achieves faster running time when the number of points is large. Thus for the /c-median problem our algo- 
rithms are fast with a small loss in performance. For the /c-center problem we compare our algorithm to the 
well known algorithm of lTT7l[T9ll , which is the best approximation algorithm for the problem and is quite 
efficient. Unfortunately, for the /c-center problem our algorithm's objective is a factor four worse in some 
cases. This is due to the sensitivity of the fc-center objective to sampling. 

Our algorithms show that the fc-center and ^-median problem belong to the theoretical MapReduce class 
MTZC that was introduced by Karloff et al. EolEI Let N denote the total size of the input, and let < e < 1 
be a fixed constant. A problem is in the MapReduce class A41ZC if it can be solved using a constant number 
of rounds and an 0(N 1 ~ e ) number of machines, where each machine has 0(N 1 ~ e ) memory available ll26l . 
Differently said, the problem has an algorithm that uses a sub-linear amount of memory on each machine 
and a sub-linear number of machines. One of the main motivations for these restrictions is that a typical 
MapReduce input is very large and it might not be possible to store the entire input on a single machine. 
Moreover, the size of the input might be much larger than the number of machines available. We discuss 
the theoretical MapReduce model in Section ll.ll Our assumptions on the size of k and the point distances 
are needed in order to show that the memory that our algorithms use on each machine is sub-linear in the 
total input size. For instance, without the assumption on k, we will not be able to fit k points in the memory 
available on a machine. 

Adapting Existing Algorithms to MapReduce: Previous work on designing algorithms for MapReduce 
are generally based on the following approach. Partition the input and assign each partition to a unique 
machine. On each machine, we perform some computation that eliminates a large fraction of the input. 
We collect the results of this computations on a single machine, which can store the data since the data 
has been sparsified. On this machine, we perform some computation and we return the final solution. We 
can use a similar approach for the /s-center and fc-median problems. We partition the points across the 
machines. We cluster each of the partitions. We select one point from each cluster and put all of the selected 
points on a single machine. We cluster these points and output the solution. Indeed, a similar algorithm 
was considered by Guha et al. ll20l for the /c-median problem in the streaming model. We give the details 

2 Recall that we only consider instances of the problems in which k is sub-linear in the number of points, and the distances 
between points are upper bounded by some polynomial in n. 
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of how to implement this algorithm in MapReduce in Section [4] along with an analysis of the algorithm's 
approximation guarantees. Unfortunately, the total running time for the algorithm can be quite large, since 
it runs a costly clustering algorithm on Q(ky/n/k) points. Further, this algorithm requires Q(kn) memory 
on each of the machines. 

Another strategy for developing algorithms for /c-center and /c-median that run in MapReduce is to try 
to adapt existing parallel algorithms. To the best of our knowledge, the only parallel algorithms known with 
provable guarantees were given by Blelloch and Tangwongsan |8j; Blelloch and Tangwongsan (H give the 
first PRAM algorithms for /c-center and A;-median. Unfortunately, these algorithms assume that the number 
of machines available is Q,(N 2 ), where N is the total input size, and there is some memory available in the 
system that can be accessed by all of the machines. These assumptions are too strong for the algorithms to 
be in used in MapReduce. Indeed, the requirements that the machines have a limited amount of memory 
and that there is no communication between the machines is what differentiates the MapReduce model from 
standard parallel computing models. Another approach is to try to adapt algorithms that were designed 
for the streaming model. Guha et al. ll20l have given a fc-median algorithm for the streaming model; with 
some work, we can adapt one of the algorithms in EUl to the MapReduce model. However, this algorithm's 
approximation ratio degrades exponentially in the number of rounds. 

Related Work: There has been a large amount of work on the metric A;-median and fc-center problems. 
Due to space constraints, we focus only on closely related work that we have not already mentioned. Both 
problems are known to be NP-Hard. Bartal (6J0 gave an algorithm for the /c-median problem that achieves 
an 0(log n log log n) approximation ratio. Later Charikar et al. gave the first constant factor approximation 
of 6+| IPT21 . This approach was based on LP rounding techniques. The best known approximation algorithm 
achieves a 3 + - approximation in 0(n c ) time 01211; this algorithm is based on the local search technique. 
On the other hand, Jain et al. |[24l have shown that there does not exist an 1 + (2/e) approximation for the 
fc-median problem unless NP C DTIME(ra°( loglogn )). For the fc-center problem, two simple algorithms 
are known which achieve a 2-approximation |23l [P71 [191 and this approximation ratio is tight assuming that 
P^NP. 

MapReduce has received a significant amount of attention recently. Most previous work has been on 
designing practical heuristics to solve large scale problems |[25ll27l . Recent papers |[26l[T8l have focused 
on developing computational models that abstract the power and limitations of MapReduce. Finally, there 
has been work on developing algorithms and approximation algorithms that fit into the MapReduce model 
ll26l[T5l . This line of work has shown that problems such as minimum spanning tree, maximum coverage, 
and connectivity can be solved efficiently using MapReduce. 

1.1 MapReduce Overview 

In this section we give a high-level overview of the MapReduce model; for a more detailed description, see 
|[26l . The data is represented as (key; value) pairs. The key acts as an address of the machine to which the 
value needs to be sent to. A MapReduce round consists of three stages: map, shuffle, and reduce. The map 
phase processes the data as follows. The algorithm designer specifies a map function u, which we refer to 
as a mapper. The mapper takes as input a (key; value) pair, and it outputs a sequence of {key; value) pairs. 
Intuitively, the mapper maps the data stored in the (key; value) pair to a machine. In the map phase, the 
map function is applied to all (key; value) pairs. In the shuffle phase, all (key; value) pairs with a given 
key are sent to the same machine; this is done automatically by the underlying system. The reduce phase 
processes the (key; value) pairs created in the map phase as follows. The algorithm designer specifies a 
reduce function p, which we refer to as a reducer. The reducer takes as input all the (key; value) pairs 
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that have the same key, and it outputs a sequence of {key; value) pairs which have the same key as the 
input pairs; these pairs are either the final output, or they become the input of the next MapReduce round. 
Intuitively, the reducer performs some sequential computation on all the data that is stored on a machine. 
The mappers and reducers are constrained to run in time that is polynomial in the size of the initial input, 
and not their input. 

The theoretical A4TZC class was introduced in |[26l . The class is designed to capture the practical re- 
strictions of MapReduce as faithfully as possible; a detailed justification of the model can be found in |[26l . 
In addition to the constraints on the mappers and reducers, there are three types of restrictions in A41ZC: 
constraints on the number of machines used, on the memory available on each of the machines, and on the 
number of rounds of computation. If the input to a problem is of size N then an algorithm is in M1ZC if 
it uses at most iV 1 ^ machines, each with at most N l ~ e memory for some constant e > (JH Notice that 
this implies that the total memory available is 0(N 2 ~ 2e ). Thus the difficulty of designing algorithms for 
the MapReduce model does not come from the lack of total memory. Rather, it stems from the fact that the 
memory available on each machine is limited; in particular, the entire input does not fit on a single machine. 
Not allowing the entire input to be placed on a single machine makes designing algorithms difficult, since a 
machine is only aware of a subset of the input. Indeed, because of this restriction, it is currently not known 
whether fundamental graph problems such as shortest paths or maximum matchings can be computed in a 
constant number of rounds, even if the graphs are unweighted. 

In the following, we state the precise restrictions on the resources available to an algorithm for a problem 
in the class MKC°. 

• Memory: The total memory used on a specific machine is at most 0(N 1 ~ e ). 

• Machines: The total number of machines used is 0(N 1 ~ e ). 

• Rounds: The number of rounds is constant. 

2 Algorithms 

In this section we describe our clustering algorithms MapReduce-kCenter and 
MapReduce-kMedian. For both of our algorithms, we will parameterize the amount of memory 
needed on a machine. For the MapReduce setting, the amount of memory our algorithms require on each 
of the machines is parameterized by 5 > and we assume that the memory is Q(k 2 n s ). It is further 
assumed that the number of machines is large enough to store all of the input data across the machines. 
Both algorithms use Iterative-Sample as a subroutine which uses sampling ideas from P2l . The 
role of Iterative-Sample is to get a substantially smaller subset of points that represents all of the 
points well. To achieve this, Iterative-Sample performs the following computation iteratively: in 
each iteration, it adds a small sample of points to the final sample, it determines which points are "well 
represented" by the sample, and it recursively considers only the points that are not well represented. More 
precisely, after sampling, Iterative-Sample discards most points that are close to the current sample, 
and it recurses on the remaining (unsampled) points. The algorithm repeats this procedure until the number 
of points that are still unrepresented is small and all such points are added to the sample. Once we have 
a good sample, we run a clustering algorithm on just the sampled points. Knowing that the sampling 
represents all unsampled points well, a good clustering of the sampled points will also be a good clustering 
of all of the points. Here the clustering algorithm used will depend on the problem considered. In the 
following section, we show how Iterative-Sample can be implemented in the sequential setting to 

3 The algorithm designer can choose e. 
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highlight the high level ideas. Then we show how to extend the algorithms to the MapReduce setting. 
2.1 Sampling Sequentially 

In this section, the sequential version of the sampling algorithm is discussed. When we mention the distance 
of a point x to a set S, we mean the minimum distance between x and any point in S. Our algorithm is 
parameterized by a constant < e < | whose value can be changed depending on the system specifications. 
Simply, the value of e determines the sample size. For each of our algorithms there is a natural trade-off 
between the sample size and the running time of the algorithm. 

Algorithm 1 Iterative-Sample(V, E, k, e): 
1: Set 5^0, R^V. 
2: while \R\> | /ere 6 log n do 

3: Add each point in R with probability ^j^- log n independently to 5. 
4: Add each point in R with probability ™ log n independently to H. 

5: v <- Select (JT, S) 

6: Find the distance of each point x G R to S. Remove x from R if this distance is smaller than the 

distance of v to S. 
7: end while 
8: Output C := S U R 



Algorithm 2 Select(iT, S): 
1: For each point x G H, find the distance of x to S. 

2: Order the points in H according to their distance to S from farthest to smallest. 
3: Let v be the point that is in the 8 log ?7,th position in the ordering. 
4: Return v. 



The algorithm Iterative-Sample maintains a set of sampled points S and a set of points R that 
contains the set of points that are not well represented by the current sample. The algorithm repeately adds 
new points to the sample. By adding more points to the sample, S will represent more points well. More 
points are added to S until the number of remaining points decreases below the threshold given in line 2. 
The point v chosen in line 5 serves as the pivot to determine which points are well represented: if a point x 
is closer to the sample S than the pivot v, the point x is considered to be well represented by S and dropped 
from R. Finally, Iterative-Sample returns the union of S and R. Note that R must be returned since 
R is not well represented by S even at the end of the while loop. 

2.2 MapReduce Algorithms 

First we show a MapReduce version of Iterative-Sample and then we give MapReduce algorithms 
for the /c-center and /s-median problems. For these algorithms we assume that for any set S and parameter 
rj, the set S can be arbitrarily partitioned into sets of size \S\/r) by the mappers. To see that this is the case, 
we refer the reader to |[26l . 

The following propositions give the theoretical guarantees of the algorithm; these propositions can also 
serve as a guide for choosing an appropriate value for the parameter e. If the probability of an event is 
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Algorithm 3 MapReduce-Iterative-Sample(V, E, k, e): 
1: Set 5^0, R<-V. 
2: while \R\ > - kn e log n do 

3: The mappers arbitrarily partition R into sets of size at most \n e ~\. Each of these sets is 

mapped to a unique reducer. 
4: For a reducer i, let R l denote the points assigned to the reducer. Reducer i adds each point in R l 

to a set S l independently with probability log n and also adds each point in R l to a set H l 

independently with probability ™ log n. 

5: Let H := Ui<i<[n e ] ^ % anc ^ 5" : = 5" U (Ui<i<[n e ] The ma PP ers assign if and 5 to a single 

machine along with all edge distances from each point in H to each point in S. 
6: The reducer whose input is if and S sets v <s— Select (if, S). 

7: The mappers arbitrarily partition the points in R into [~n 1_e ] subsets, each of size at most [~|i?|/n 1_e ] . 

Let R 1 for 1 < i < [n 1 " 6 ] denote these sets. Let v, R l , S, the distances between each point in R l 

and each point in S be assigned to reducer i. 
8: Reducer i finds the distance of each point x € R l to S. The point x is removed from R l if this 

distance is smaller than the distance of v to S. 

9: Leti?:=U 6M #. 
10: end while 

11: Output C := SUR 



1 — 0(1/ n), we say that the event occurs with high probability, which we abbreviate as w.h.p. The first 
two propositions follow from the fact that, w.h.p., each iteration of Iterative-Sample decreases the 
number of remaining points — i.e., the size of the set R — by a factor of G(n e ). We give the proofs of these 
propositions in the next section. Note that the propositions imply that our algorithm belongs to A41ZC . 

Proposition 2.1. The number of iterations of the while loop of Iterative-Sample is at most O(^) 
w.h.p. 

Proposition 2.2. The set returned by Iterative -Sample has size 0(-kn £ logn) w.h.p. 

Proposition 2.3. MapReduce-Iterative-Sample is a MapReduce algorithm that requires O(-) 
rounds when machines have memory 0(kn s ) for a constant 5 > 2e w.h.p. 

Proof. Consider a single iteration of the while loop. Each iteration takes a constant number of MapReduce 
rounds. By Proposition 12. 1[ the number of iterations of this loop is O(-), and therefore the number of 
rounds is O(-). The memory needed on a machine is dominated by the memory required by Step ©. The 
size of S is 0(\kn e logn) by Proposition 12.21 Further, the size of R l is at most n/n 1_<E = n e . Let rj be the 
maximum number of bits needed to represent the distance from one point to another. Thus the total memory 
needed on a machine is 0(\kn e log n ■ n e ■ rj), the memory needed to store the distances from points in R l 
to points in S. By assumption t) = 0(log n), thus the total memory needed on a machine is upper bounded 
by 0(\kn 2e log 2 n). By setting 5 to be a constant slightly larger than 2e, the proposition follows. □ 

Once we have this sampling algorithm, our algorithm MapReduce-kCenter for the A;-center problem 
is fairly straightforward. This is the algorithm considered in Theorem 11.11 The memory needed by the 
algorithm is dominated by storing the pairwise distances between points in C on a single machine. By 
Proposition 12.21 and the assumption that the maximum distance between any two points can be represented 
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using 0(log n) bits, w.h.p. the memory needed is 0((-kn e log n) 2 -log n) = 0(n s k 2 ), where 5 is a constant 
greater than 2e. 

Algorithm 4 MapReduce-kCenter(V, E 1 , fc, e): 
1: Let C <r- Iterative-Sample(V, E, k, e). 

2: Map C and all of pairwise distances between points in C to a reducer. 
3: The reducer runs a /c-center clustering algorithm A on C. 
4: Return the set constructed by A. 



However, for the fc-median problem, the sample must contain more information than just the set of 
sampled points. This is because the -median objective considers the sum of the distances to the centers. To 
ensure that we can map a good solution for the points in the sample to a good solution for all of the points, 
for each unsampled point x, we select the sampled point that is closest to x (if there are several points that 
are closest to x, we pick one arbitrarily). Additionally, we assign a weight to each sampled point y that is 
equal to the number of unsampled points that picked y as its closest point. This is done so that, when we 
cluster the sampled points on a single machine, we can take into account the effect of the unsampled points 
on the objective. For a point x and a set of points A, let d(x, A) denote the minimum distance from the 
point x € V to a point in A, i.e., d(x, A) = min^g^ d(x, y). The algorithm MapReduce-kMedian is the 
following. 

Algorithm 5 MapReduce-kMedian(V, E, k, e): 
1: Let C <— MapReduce-Iterative-Sample(V, E, k, e) 

2: The mappers arbitrarily partition V into [n 1 " 6 ] sets of size at most \n e ~\. Let V 1 for 1 < i < [n 1_e ] be 
the partitioning. 

3: The mappers assign V 1 , C and all distances between points in V 1 and C to reducer i for all 1 < i < 

4: Each reducer i, for each y G C, computes w l (y) = \{x G V 1 \ C \ d(x, y) = d(x, C)}\. 

5: Map all of the weights w l (-), C and the pairwise distances between all points in C to a single reducer. 

6: The reducer computes w{y) = Ylie[m] wl (y) + 1 f° r an 2/ € C. 

7: The reducer runs a weighted fc-median clustering algorithm A on that machine with (C, w, k) as input. 

8: Return the set constructed by A. 



The MapReduce-kMedian algorithm performs additional rounds to give a weight to each point in the 
sample C. We remark that these additional rounds can be easily removed by gradually performing this oper- 
ation in each iteration of MapReduce-Iterative-Sample. The maximum memory used by a machine 
in MapReduce-kMedian is bounded similarly as MapReduce-kCenter. The proof of all propositions 
and theorems will be given in the next section. The algorithm MapReduce-kMedian is the algorithm con- 
sidered in Theorem 1 1.21 Notice that both MapReduce-kMedian and MapReduce-kCenter use some 
clustering algorithm as a subroutine. The running times of these clustering algorithms depend on the size of 
the sample and therefore there is a trade-off between the running times of these algorithms and the number 
of MapReduce rounds. 
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3 Analysis 



3.1 Subroutine: Iterative-Sample 

This section is devoted to the analysis of Iterative-Sample, the main subroutine of our clustering 
algorithms. Before we give the analysis, we introduce some notation. Let S* denote any set. We will show 
several lemmas and theorems that hold for any set S*, and in the final step, we will set S* to be the optimal 
set of centers. The reader may read the lemmas and theorems assuming that S* is the optimal set of centers. 
We assign each point x € V to its closest point in S*, breaking ties arbitrarily but consistently. Let x be 
the point in S* to which x is assigned; if x is in S*, we assign x to itself. Let S*(x) be the set of all points 
assigned to x G S* . 

We say that a point x is satisfied by S with respect to S* if d(S,x s *) < d(x,x s *). If S and S* are 
clear from the context, we will simply say that x is satisfied. We say that x is unsatisfied if it is not satisfied. 
Throughout the analysis, for any point x in V and any subset S C V , we will let x s denote the point in S 
that is closest to x. 

We now explain the intuition behind the definition of "satisfied". Our sampling subroutine's output 
C may not include each center in S*. However, a point x could be "satisfied", even though x s C, 
by including a point in C that is closer to x than x s . Intuitively, if all points are satisfied, our sampling 
algorithm returned a very representative sample of all points, and our clustering algorithms will perform 
well. However, we cannot guarantee that all points are satisfied. Instead, we will show that the number of 
unsatisfied points is small and their contribution to the clustering cost is negligible compared to the satisfied 
points' contribution. This will allow us to upper bound the distance between the unsatisfied points and the 
final solution constructed by our algorithm by the cost of the optimal solution. 

Since the sets described in Iterative-Sample change in each iteration, for the purpose of the 
analysis, we let Rg, Si, and Hg denote the sets R, S, and H at the beginning of iteration I. Note that 
R\ = V and Si = 0. Let Dg denote the set of points that are removed (deleted) during iteration I. Note that 
Rg+i = Re — Dg. Let Ug denote the set of points in Re that are not satisfied by Se+i with respect to S*. Let 
C denote the set of points that Iterative-Sample returns. Let U denote the set of all unsatisfied points 
by C with respect to S* . If one point is satisfied by St with respect to S* then it is also satisfied by C with 
respect to S*, and therefore U C Uf>i Ug. 

We start by upper bounding | Ug \ , the number of unsatisfied points at the end of iteration I. 

Lemma 3.1. Let S* be any set with no more than k points. Consider iteration t of Iter at i ve-Sampl e, 



where i > 1. Then Pr 



< 



Proof. Consider any point y in S*. Recall that S*(y) denotes the set of all points that are assigned to y. 
Note that it suffices to show that 



Pr 



\u e ns*( y )nR e \> 1 



3kn e 



1 

< — 



This is because the lemma would follow by taking the union bound over all points in S* (recall that \S* \ < 
k < n). Hence we focus on bounding the probability that the event | U# fl S* (y) PI Rg \ > occurs. The 

event implies that none of the closest points in S*(y) n Re from y was added to Sg. This is because if 
any of such points were added to Si, then all points in S*(y) D Re farther than the point from y would be 
satisfied. Hence we have 



Pr 



\u t ns*(y)nRt\ > 1 



3kn e 



9kn £ , N JM 1 
-— logn )s*»« < — 
\Rg\ n 6 
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This completes the proof. 



□ 



Recall that we selected a threshold point v to discard the points that are well represented by the current 
sample S. Let r ank/j £ (v) denote the number of points x in Re such that the distance from x to S is greater 
than the distance from v to S. The proof of the following lemma follows easily from the Chernoff inequality. 

Lemma 3.2. Let S* be any set with no more than k points. Consider any l-th iteration of the while loop 
of Iterative-Sample. Let V£ denote the threshold in the current iteration, i.e. the (8logn)-th farthest 
point in He from Se+i- Then we have Pr[^^- < rank(ve) < 4 ^ ] > 1 — \. 



Proof. Let r = L-f-. Let N< r denote the number of points in He that have ranks smaller than r, i.e. 
N< r = \{x G Hi | rankij f (x) < r}\. Likewise, N<± r = \{x € He | rank^(x) < 4r}|. Note that 
E[iV< r ] = 41ogra and E[iV<4 r ] = 16 log n. By Chernoff inequality, we have Pr [N< r > 81ogn] < -\ and 
Pr [A^<4 r < 8 log n] < -ij. Hence the lemma follows. □ 

Corollary 3.3. Consider any l-th iteration of the while loop of Iterative-Sample. Then Pr[^r- < 



The above corollary immediately implies Proposition 12.11 and 12.21 Now we show how to map each 
unsatisfied point to a satisfied point such that no two unsatisfied points are mapped to the same satisfied 
point; that is, the map is injective. Such a mapping will allow us to bound the cost of unsatisfied points by 
the cost of the optimal solution. The following theorem is the core of our analysis. The theorem defines a 
mapping p : U — > V; for each point x, we refer to p(x) as the proxy point of x. 

Theorem 3.4. Consider any set S* C V. Let C be the set of points returned by Iterative-Sample 
Let U be the set of all points inV — C that are unsatisfied by C with respect to S*. Then w.h.p., there exists 
an injective fiinction p : U — > V \U such that, for any x € U, d(p(x), S*) > d(x, C). 

Proof. Throughout the analysis, we assume that \Ui\ < |i2£|/(3n e ) and < \Rt+i\ < for each 
iteration I. By Lemma [37T] Corollary 13.31 and a simple union bound, it occurs w.h.p. 

Let £j denote the final iteration. Let A(£) := Ri+\ \ Ue- Roughly speaking, A{i) is a set of candidate 
points, to which each x 6 Ug PI Dg is mapped. Formally, we show the following properties: 

1. for any x G Ue n D £ and any y G A{1), d(x, Se) < d(y, Se) < d(y, S*). 

2. \A{£)\>Y^t\Ui\. 

3. Lfoi(^nA) 2 u. 

The first property holds because any point in Re+i = Re — Di 5 A(£) is farther from Se + \ than any point 
in De, by the definition of the algorithm. 

The inequality d(y, Se) < d(y, S*) is immediate since y is satisfied by Se for C*. The second property 
follows since \A(£)\ > \Re+i\ — \Ug\ > — > ^2i'>eWi\- ^he last inequality holds because 
\Ue\ < an d | Ue | decrease by a factor of more than two as £ grows. We now prove the third property. 
The entire set of points V is partitioned into disjoints sets Di,L>2, ■■-,Dg, and Re f +i- Further, for any 
1 < £ < £f, any point in U n De is unsatisfied by Se with respect to S*, thus the point is also in Ue H D^. 
Finally, the set Re f +i C C are clearly satisfied by C. 
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We now construct p(-) as follows. Starting with t = £f down to £ = 1, we map each unsatisfied point 
in Ug n Dp to \A(£)\ so that no point in A(£) is used more than once. This can be done using the second 
property. The requirement of p(-) that for any x G U, d(p(x), S*) > d(x, C) is guaranteed by the first 
property. Finally, we simply ignore the points in U^=i(^ ^ ^i) \ U ■ This completes the proof. □ 

3.2 MapReduce-KCenter 

This section is devoted to proving Theorem ll.il For the sake of analysis, we will consider the following 
valiant of the fc-center problem. In the kCenter(V, T) problem, we are given two sets V and T C V 
of points in a metric space, and we want to select a subset S* C T such that \S*\ < k and S* min- 
imizes max xg y d(x, S) among all sets S C T with at most A; points. For notational simplicity, we let 
OPT(V,T) denote the optimal solution for the problem kCenter(V, T). Depending on the context, 
OPT(V, T) may denote the cost of the solution. Since we are interested eventually in OPT(V, V), we 
let OPT := OPT(V, V). 

Proposition 3.5. Let C be the set of centers returned by Iterative-Sample. Then w.h.p. we have that 
for any xeV, d(x, C) < 20PT. 

Proof. Let S* := OPT denote a fixed optimal solution for kCenter(V, V). Let U be the set of all 
points that are not satisfied by C with respect to S*. Consider any point x that is satisfied by C con- 
cerning 5*. Since it is satisfied, there exists a point a € C such that d(a, x s *) < d(x,x s *) = d(x,S*). 
Then by the triangle inequality, we have d(x,C) < d(x,a) < d(x,x s *) + d(a, x s *) < 2d(x,S*) < 
2 max y ^[/ d(y, S*) = 20PT. Now consider any unsatisfied x. By Theorem 13.41 we know that w.h.p. there 
exists a proxy point p(x) for any unsatisfied point x G U. Then using the property of proxy points, we have 
d(x, C) < d(p(x), S*) < d(p(x),S*) < m&Xy^u d(y, S*) < OPT. □ 

Proposition 3.6. Let C be the set of centers returned by Iterative-Sample. Then w.h.p. we have 
OPT(C,C) < OPT(V,C) < OPT. 

Proof. Since the first inequality is trivial, we focus on proving the second inequality. Let S* be an optimal 
solution for kCenter(V, V). We construct a set T C C as follows: for each x G S*, we add to T the point 
in C that is closest to x. Note that \T\ < kby construction. For any x G V, we have 

d(x,T) < d(x,x s ") + d(x s ",T) = d(x,x s ") + d(x s \x c ) 
[Since the closest point in C to x s * is in T] 
< d(x,x s ") + d(x,x s ") + d(x,x c ) 
= 2d(x,S*) + d(x,C) < 20PT + d(x,C) 

By Proposition [331 we know that w.h.p. for all x G V, d(x, C) < 20PT(F, V). Therefore, for all x G V, 
d(x, T) < 40PT. Since OPT(V, C) < OPT(V, T), the second inequality follows. □ 

Theorem 3.7. If A is an algorithm that achieves an a-approximation for the k center problem, then w.h.p. 
the algorithm MapReduce-kCenter achieves a (4a + ^^-approximation for the k center problem. 

Proof. By Proposition 1331 OPT(C,C) < 40PT. Let S be the set returned by MapReduce-kCenter. 
Since A achieves an a-approximation for the k center problem, it follows that 

maxd(x,S) < aOPT(C,C) < 4aOPT 
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Let x be any point. By Proposition 13.51 

d(x, C) < 20PT 

Therefore 

d(x, S) < d(x c , S) + d(x, x c ) < (4a + 2)OPT 

□ 

By setting the algorithm A to be the 2-approximation of |[T7l[T9l , we complete the proof Theorem ll.il 
3.3 MapReduce-KMedian 

In the following, we will consider the following variants of the fc-median problem similar to the variant of 
the /c-center problem considered in the previous section. In the kMedian(V, T) problem, we are given two 
sets V and T C V of points in a metric space, and we want to select a subset S* C T such that \S*\ < k 
and S* minimizes J2xev ^( x > amon g au sets S C T with at most k points. We let OPT(V, T) denote 
a fixed optimal solution for kMedian(V, T) or the optimal cost depending on the context. Note that we 
are interested in obtaining a solution that is comparable to OPT(V, V). Hence, for notational simplicity, 
we let OPT := OPT(V, V). In the Weighted-kMed±an(V, w) problem, we are given a set V of points 
in a metric space such that each point x has a weight w{x), and we want to select a subset S* C V such 
that 1 5* | < k and S* minimizes Ylxev w(x)d(x, S) among all sets S C V with at most k points. Let 
OPT w (V, w) denote a fixed optimal solution for a Weighted-kMedian(V, w). 

Recall that MapReduce-kMedian computes an approximate /c-medians on C with each point x in C 
having a weight w(x). Hence we first show that we can obtain a good approximate fc-medians using only 
the points in C. 

Proposition 3.8. Let S* := OPT. Let C be the set of centers returned by Iterative-Sample. Then 
w.h.p., we have that Ylxev ^0 — 30PT. 

Proof. Let U denote the set of points that are not unsatisfied by C with respect to 5*. By Theorem 13.41 
w.h.p. there exist proxy points p(x) for all unsatisfied points. First consider any satisfied point x £ U. 
It follows that there exists a point a € C such that d(a, x s *) < d(x,x s *) = d(x,S*). By the triangle 
inequality, d(x,C) < d(x,a) < d(x,x s ") + d(a,x s *) < 2d(x,S*). Hence Ylx^u d ( x > C ) ^ 20PT. We 
now argue with the unsatisfied points. Ylxeu ^) — SxeC/ d(p{x), S*) < OPT. The last inequality is 
due to property that p(-) is injective. □ 

Proposition 3.9. Let C be the set returned by Iterative-Sample Then w.h.p., OPT(V, C) < 50PT. 

Proof. Let S* be an optimal solution for kMedian(V, V). We construct a set T C C as follows: for each 
x € 5*, we add to T the point in C that is closest to x. By construction \T\ < k. For any x, we have 

d(x,T) < d{x,x s *) + d{x s *,T) < d(x,x s ") + d(x s \x c ) 
[The closest point in C to x s * is in T] 

< d{x,x s *) + d{x,x s ') + d(x,x c ) = 2d(x,S*) + d(x,C) 

By applying Proposition EDS w.h.p. we have Ylxev d ( x ' T ) < 2 E^ev S *) + T,xev d ( x ' c ) ^ 50PT. 
Since T is a feasible solution for kMedian(y, C), it follows that OPT(V, C) < 50PT. □ 
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So far we have shown that we can obtain a good approximate solution for the kMedian(V, V) even 
when we are restricted to C. However, we need a stronger argument, since MapReduce-kMedian only 
sees the weighted points in C and not the entire point set V. 

Proposition 3.10. Consider any subset of points C C V. For each point y € C, let w(y) = \{x € 
V - C | x c = y}\ + 1. Then we have OPT w (C, w) < 20PT(V, C). 

Proof. Let T* := OPT(V, C). Let C := V \ C. For each point ifC.we have d(x, x c ) + d(x, x T *) > 
d(x c ,x T *) > d{x c ,T*). Therefore ExgC T *) ^ ExgcK^' T *) _ d(x,x c )). Further we have, 



2Y,d(x,T*) > ^d(^,T*) + ]T(d(x,T*)-d(*,* c )) 
xgc x&c xec 

> ^d(x c ,T*) [d(x, T*) > d(x, C), since T* C C] 



x&C 

= E E d(y,T*) = J2(w(y)-l)d(y,T*) 

s/gc x& c-.x c =y yec 

Hence we have OPT(y, C) = 2 Y^xav d ( x , T *) ^ E^eC w (v) d {y, T *)- Since T * is a feasible solution for 
Weighted-kMedian(C,w), it follows that OPT w (C,w) < 20PT(F,C). □ 

Theorem 3.11. If A is an algorithm that achieves an a-approximation for Weight ed-kMedi an, w.h.p. 
the algorithm MapReduce-kMedian achieves a (10a + ^-approximation for kMedian. 

Proof It follows from Proposition[321andProposition[3ll0]thatw.h.p.,OPT u '(C',it;) < 10OPT. 

Let S be the set returned by MapReduce-kMedian. Since A achieves an a-approximation for 
Weighted-kMedian, it follows that 

^2w(y)d(y,S) < aOPT w (C,w) < lOaOPT 

We have 

= E d ^ 5 ) + E d ( x ' 5 ) 

xev yec x£ c 

y&c x& c 

< ^d(y,S) + J2(d(x,x c ) + d(x c ,S)) 

yec xec 
= y £w(y)d(y,S) + y £d(x,C) 

By Proposition El] (with S* equal to OPT(V, V)), we get that 

d(x, C) < 3 Y d(x, S*) = 30PT 

xGV x<EV~ 

Therefore 

J2d(x,S) < (10a + 3)OPT 

xGV 

□ 
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Figure 1 : The relative cost and running time of clustering algorithms when the number of points is not too 
large. The costs are normalized to that of Parallel-Lloyd. The running time is given in seconds. 
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Figure 2: The relative cost and running time of the scalable algorithms when the number of points are large. 
The costs were normalized to that of Parallel-Lloyd. The running time is given in seconds. 

Recall that there is a (3 + 2/c) approximation algorithm for /c-median that runs in 0(n c ) time BED. In 
order to complete the proof of Theorem 1 1.2[ we pick a constant c and we use the (3 + 2/c) -approximation 
algorithm. 

4 Experiments 

In this section we give an experimental study of the algorithms introduced in this paper. The focus for 
this section is on the /c-median objective because this is where our algorithm gives the largest increase in 
performance. Unfortunately, our sampling algorithm does not perform well for the /c-center metric. This 
is because the fc-center objective is quite sensitive to sampling. Since the maximum distance from a point 
to a center is considered in the objective, if the sampling algorithm misses even one important point then 
the objective can substantially increase. From now on, we only consider the /c-median problem. In the 
following, we describe the algorithms we tested and we give an overview of the experiments and the results. 



14 



4.1 Implemented Algorithms 

We compare our algorithm MapReduce-kMedian to several algorithms. Recall that 
MapReduce-kMedian uses Iterative-Sample as a sub-procedure and we have shown that 
MapReduce-kMedian gives a constant approximation when the local search algorithm (U [21]] is 
applied on the sample that was obtained by Iterative-Sample. We also consider Lloyd's algorithm 
together with the sampling procedure Iterative-Sample; that is, in MapReduce-kMedian, the 
algorithm A is Lloyd's algorithm and it takes as input the sample constructed by Iterative-Sample. 
Note that Lloyd's algorithm does not give an approximation guarantee. However, it is the most pop- 
ular algorithm for clustering in practice and therefore it is worth testing its performance. We will use 
Sampling-LocalSearch to refer to MapReduce-kMedian with the local search algorithm as A and 
we will use Sampling-Lloyd to refer to MapReduce-kMedian with Lloyd's algorithm as A. Note 
that the only difference between Sampling-LocalSearch and Sampling-Lloyd is the clustering 
algorithm chosen as A in MapReduce-kMedian. 

We also implement the local search algorithm and Lloyd's algorithm without sampling. The local search 
algorithm, denoted as LocalSearch, is the only sequential algorithm among all algorithms that we imple- 
mented ||4ll2T). We implement a parallelized version of Lloyd's algorithm, Parallel-Lloyd [28 , 121 [TJ- 
This implementation of Lloyd's algorithm parallelizes a sub-procedure of the sequential Lloyd's algorithm. 
The parallel version of Lloyd's gives the same solution as the sequential version of Lloyd's; the only dif- 
ference between the two implementations is the parallelization. We give a more formal description of the 
parallel Lloyd's algorithm below. 

Finally, we implement clustering algorithms based on a simple partitioning scheme used to adapt se- 
quential algorithms to the parallel setting. In the partition scheme MapReduce-Divide-kMedian 
we consider, points are partitioned into £ sets of size In parallel, centers are computed for each 

of the partitions. Then all of the centers computed are combined into a single set and the centers are 
clustered. We formalize this in the algorithm MapReduce-Divide-kMedian. We evaluated the local 
search algorithm and Lloyd's algorithm coupled with this partition scheme. Throughout this section, we use 
Divide-LocalSearch for the local search together with this partition scheme. We call Lloyd's algo- 
rithm coupled with the partition scheme as Divide-Lloyd. We give the details of the partition framework 
MapReduce-Divide-kMedian shortly. 

The following is a summary of the algorithms we implemented: 

• LocalSearch: Local Search 

• Parallel-Lloyd: Parallel Lloyd's 

• Sampling-LocalSearch: Sampling and Local Search 

• Sampling-Lloyd: Sampling and Lloyd's 

• Divide-LocalSearch: Partition and Local Search 

• Divide-Lloyd: Partition and Lloyd's 

A careful reader may note that Lloyd's algorithm is generally used for the fc-means objective and not for 
A;-median. Lloyd's algorithm is more commonly used for /c-means, but it can be used for fc-median as well, 
and it is one of the most popular clustering algorithms in practice. We note that the parallelized version of 
Lloyd's algorithm we introduce only works with points in Euclidean space. 

Parallel Lloyd's Algorithm: We give a sketch of parallelized implementation of Lloyd's algorithm used 
in the experiments. More details can be found in BUI. The algorithm begins by partitioning the points 
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evenly across the machines and these points will remain on the machines. The algorithm initializes the k 
centers to an arbitrary set of points. In each iteration, the algorithm improves the centers as follows. The 
mapper sends the k centers to each of the machines. On each machine, the reducer clusters the points on 
the machine by assigning each point to its closest center. For each cluster, the averagd3 of the points in 
the cluster is computed along with the number of points assigned to the center. The mappers map all this 
information to a single machine. For each center, the mappers aggregate the points assigned to the center 
over all partitions along with the centers, and then the reducers update the center to be the average of these 
points. It is important to note that the solution computed by the algorithm is the same as the sequential 
version of Lloyd's algorithm. 

Partitioning Based Scheme: We describe the partition scheme MapReduce-Divide-kMedian 
that is used for the Divide-LocalSearch and Divide-Lloyd algorithms. The algorithm 
MapReduce-Divide-kMedian is a partitioning-based parallelization of any arbitrary sequential clus- 
tering algorithm. We note that this algorithm and the following analysis have also been considered by Guha 
et al. ll20l in the streaming model. 

Algorithm 6 MapReduce-Divide-kMedian(V, E, k,£): 
1: Letra = \V\. 

2: The mappers arbitrarily partition V into disjoint sets Si, • • • , Se, each of size @(n/£). 
3: for i = 1 to £ do 

4: The mapper assigns Si and all the distances between points in Si to reducer i. 
5: Reducer i runs a fc-median clustering algorithm A with (Si,k) as input to find a set C, C Si of k 
centers. 

6: Reducer i computes, for each y G Cj, w{y) = \{x G 5* \ Cj | d(x, y) = d(x, Ci)}\ + 1. 

7: end for 

8: Let C = U- =1 Ci. 

9: The mapper sends C, the pairwise distances between points in C and the numbers w(- ) to a single 
reducer. 

10: The reducer runs a Weighted-kMedian algorithm A with (C, w, k) as input. 
11: Return the set constructed by A 



It is straightforward to verify that setting £ = -\/n/k minimizes the maximum memory needed on 
a machine; in the following, we assume that t = y/n/k. The total memory used by the algorithm is 
0{kn log n). (Recall that we assume that the distance between two points can be represented using 0(log n) 
bits.) Additionally, the memory needed is also Q(kn), since in Step (©, Q(^n/k) sets of k points are sent to 
a single machine along with their pairwise distances. The following proposition follows from the algorithm 
description. 

Proposition 4.1. MapReduce-Divide-kMedian runs in O(l) MapReduce rounds. 

From the analysis given in ff20l . we have the following theorem which can be used to bound the approx- 
imation factor of MapReduce-Divide-kMedian. 

Theorem 4.2 (Theorem 2.2 in [20]). Consider any set of n points arbitrarily partitioned into disjoint sets 
S\ , • • • , Se- The sum of the optimum solution values for the k-median problem on the £ sets of points is at 
most twice the cost of the optimum k-median problem solution for all n points, for any £ > 0. 

4 Recall that the input to Lloyd's algorithm is a set of points in Euclidean space. The average of the points is the point in 
Euclidean space whose coordinates are the average of the coordinates of the points. 
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Corollary 4.3 ( 12010 . If the algorithm A achieves an a-approximation for the k-median problem, the algo- 
rithm MapReduce-D iv ide-kMedian achieves a 3a- approximation for the k-median problem. 

By this Corollary, note that Divide-LocalSearch is a constant factor approximation. 

4.2 Experiment Overview 

We generate a random set of points in R 3 . Our data set consists of k centers and randomly generated points 
around the centers to create clusters. The k centers are randomly positioned in a unit cube. The number of 
points generated within a cluster is sampled from a Zipf distribution. More precisely, let {Cj}i<j<fc be the 
set of clusters. Given a fixed number of points, a unique point is assigned to the cluster Cj with probability 
i a /Yli=i i a where a is the parameter of the Zipf distribution. Notice that when a = 0, all clusters will 
have almost the same size and, as a grows, the sizes of the clusters become more non-uniform. The distance 
between a point and its center is sampled from a normal distribution with a fixed global standard deviation 
a. Each experiment with the same parameter set was repeated three times and the average was calculated. 
When running the local search or Lloyd's algorithm, the seed centers were chosen arbitrarily. 

All experiments were performed on a single machine. When running MapReduce algorithms, we simu- 
lated each machine used by the algorithm. For a given round, we recorded the time it takes for the machine 
that ran the longest in the round. Then we summed this time over all the rounds to get the final running 
time of the parallel algorithms. In these experiments, the communication cost was ignored. More precisely, 
we ignored the time needed to move data to a different machine. The specifications of the machine were 
Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz and the memory available was 8GB. We used the standard 
clock() function to measure the time for each experiment. All parallel algorithms were simulated assum- 
ing that there are 100 machines. For the algorithm MapReduce-kMedian the value of e was set to .1 for 
the sampling probability. 

4.3 Results 

Because of the space constraints, we only give a brief summary of our results. The data can be found in 
Figures Q] and [2] For the data in the figures, the number of points is the only variable, and other parameters 
are fixed: a = 0.1, a = and k = 25. The cost of the algorithms' objectives is normalized to the cost of 
Parallel-Lloyd in the figures. Figure Q] summarizes the results of the experiments on data sets with 
at most 10 6 points, and Figure |2] summarizes the results of the experiments on data sets with at most 10 
points. 

Our experiments show that Sampling-Lloyd and Sampling-LocalSearch achieve a sig- 
nificant speedup over Parallel-Lloyd (about 20x), a speedup of more than ten times over 
Divide-LocalSearch and a significant speedup over LocalSearch (over lOOOx) as seen in Fig- 
ured] The speedup increases very fast as the number of points increases. Further, this speedup is achieved 
with negligible loss in performance; our algorithm's objective performs close to the Parallel-Lloyd 
and LocalSearch when the number of points is sufficiently large. 

Finally, we compare the performance of Sampling-LocalSearch and Sampling-Lloyd with 
the performance of Divide-Lloyd on the largest data sets; the results are summarized in Figure |2] 
These algorithms were chosen because they are the most scalable and perform well; as shown in Figure [T] 
LocalSearch is far from scalable. Although Divide-LocalSearch's running times are similar to 
Parallel-Lloyd's, we were not able to run additional experiments with Divide-LocalSearch be- 
cause it takes a very long time to simulate on a single machine. These additional experiments show that, 
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for data sets consisting of 5 x 10 6 points, the running time of Sampling-LocalSearchis slightly 
larger than Divide-Lloyd's and the clustering cost of Sampling-LocalSearch is similar to the 
cost of Divide-Lloyd. The algorithm Sampling-Lloyd achieves a speedup of about 25% over 
Divide-Lloyd when the number of points is 10 7 . Overall the experiments show that, when coupled 
with Lloyd's algorithm, our sampling algorithm runs faster than any previously known algorithm that we 
considered, and this speedup is achieved at a very small loss in performance. We also ran experiments with 
different settings for the parameters a, k, and a, and the results were similar; we omit these results from this 
version of the paper. 

5 Conclusion 

In this paper we give the first approximation algorithms for the fc-center and fc-median problems that run in 
a constant number of MapReduce rounds. We note that we have preliminary evidence that the analysis used 
for the A;-median problem can be extended to the fc-means problem in Euclidean space; for this problem, our 
analysis also gives a MapReduce algorithm that runs in a constant number of rounds and achieves a constant 
factor approximation. 
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